Telemarketing is a method of direct marketing which a person (can be sales) prospective customers to buy products or services, either over the phone or through face to face or web conferencing appointment. Telemarketing can also include recoreded sales pithes programmed to be played over the phone by automatic dialing.
Bank is one of the organisation use telemarketing method for selling banking products or services. telemarketing is a popular method used by bank to selling, because bank products and services sometimes too complicated for some users to understand. It more easy to users or target user to understand products or service if it explain directly. One advanteage of telemarketing by person, target users can directly asking question, if they didnt understand something.
Nowdays, Telemarketing has been negatively associated with various scams and frauds, such as pyramid schemes, and with deceptively overpriced products and services. Fraudulent telemarketing companies are frequently referred to as “telemarketing boiler rooms” or simply “boiler rooms”. Telemarketing is often criticized as an unethical business practice due to the perception of high-pressure sales techniques during unsolicited calls. Telemarketers marketing telephone companies may participate in telephone slamming, the practice of switching a customer’s telephone service without their knowledge or authorization.
Bank as financing organisation really care about good reputation and good branding, and one of bad thing do telemarketing can interfere reputation it self. So we need find out which our target will not buy product or service if bank offer product or service using telemarketing. It can help protect bank reputation by not disturbing target that we already know will not buy the product.
In this case we will use machine learning to understand pattern and predict classification or label, we use several predictive model to predict using training and testing data. Predictive model we use is, Naive Bayes Classifier, Decision Tree, and Random Forest.
We will compare the result of predction and see the performance from each mode. This 3 model are categorized as supervised learning. Supervised learning popular to predict pattern, this pattern can learn from train data and do ETL (Extract Transform Load) to get feature information. Based from feature we will compare with clasification patter from model get from labeled data to get final prediction.
# Data wrangling Library
library(tidyverse)
library(dplyr)
# Visualize data
library(ggplot2)
library(inspectdf)
library(GGally)
library(plotly)
# Naive Bayes
library(e1071)
# Splitting Data
library(rsample)
# Random Forest
library(randomForest)
# Smote for unbalanced data
library(DMwR)
# ROCR
library(ROCR)
# Confussion Matrix
library(caret)
# Decision Tree
library(partykit)
source("matrix_result.R")
source("metrics.R")
Telemarketing dataset was obtained from UCI Machine Learning Repository, The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
telemark <- read_csv2("data/bank-full.csv")
glimpse(telemark)
## Observations: 45,211
## Variables: 17
## $ age <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57,…
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", …
## $ marital <chr> "married", "single", "married", "married", "single", "marri…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown",…
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no"…
## $ balance <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 7…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unk…
## $ day <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "ma…
## $ duration <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pdays <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unk…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
Column Description:
1. age: age (numeric)
2. job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
3. marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
4. education : education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
5. default: has credit in default? (binary: “yes”,“no”)
6. balance: average yearly balance, in euros (numeric)
7. housing: has housing loan? (binary: “yes”,“no”)
8. loan: has personal loan? (binary: “yes”,“no”)
9. contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
10. day: last contact day of the month (numeric)
11. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12. duration: last contact duration, in seconds (numeric)
13. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15. previous: number of contacts performed before this campaign and for this client (numeric)
16. poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
17. y: has the client subscribed a term deposit? (binary: “yes”,“no”)
Missing Value (NA) is general problem from dataaset, there’s some way to solve the problem. Based on several refrence say that there is NO good way to deal with missing data. So before we going forward to next step, we should check missing value.
table(is.na(telemark))
##
## FALSE
## 768587
The data has no missing value, so we doesnt need any thing with missing value.
telemark <- telemark %>%
mutate(job = as.factor(job),
marital = as.factor(marital),
education = as.factor(education),
default = as.factor(default),
housing = as.factor(housing),
loan = as.factor(loan),
contact = as.factor(contact),
month = as.factor(month),
poutcome = as.factor(poutcome),
subscribe = as.factor(y)) %>%
select(-c(y))
summary(telemark)
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome subscribe
## Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
show_plot(inspect_cor(subset(telemark, select = -c(subscribe))))
ggcorr(telemark, label = T)
## Warning in ggcorr(telemark, label = T): data in column(s) 'job', 'marital',
## 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome',
## 'subscribe' are not numeric and were ignored
numericCols <- unlist(lapply(telemark, is.numeric))
show_plot(inspect_num(telemark[,numericCols]))
prop.table(table(telemark$subscribe))
##
## no yes
## 0.8830152 0.1169848
set.seed(1)
split <- initial_split(data = telemark, prop = 0.8, strata = subscribe)
telemark_train <- training(split)
telemark_test <- testing(split)
prop.table(table(telemark_train$subscribe))
##
## no yes
## 0.8832426 0.1167574
# telemark_train_upsample <- upSample(x = telemark_train[, -17], y = telemark_train$subscribe, yname = "subscribe")
telemark_train_upsample <- SMOTE(subscribe ~ ., as.data.frame(telemark_train), perc.over = 100, perc.under = 200)
prop.table(table(telemark_train_upsample$subscribe))
##
## no yes
## 0.5 0.5
model_naive <- naiveBayes(subscribe ~ ., data = telemark_train_upsample)
naive_prediction <- predict(model_naive, telemark_test)
naive_prediction_raw <- as.data.frame(predict(model_naive, telemark_test, type = "raw"))
naive_prediction_raw <- naive_prediction_raw %>%
mutate(no = round(no,4),
yes = round(yes,4))
naive_matrix <- confusionMatrix(naive_prediction, telemark_test$subscribe, positive = "yes")
table <- as.table(naive_matrix)
table <- as.data.frame(table)
table %>% ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), fontface = "bold", color = "white") +
theme_minimal() +
theme(legend.position = "none")
naive_matrix <- matrix_result(naive_matrix, "Naive Bayes")
naive_matrix
## Model Accuracy Sensitivity Specificity Pos Pred Value
## 1 Naive Bayes 0.7614466 0.8095685 0.755015 0.3063543
# ROC
naive_roc <- data.frame(prediction = naive_prediction_raw[,2],
trueclass = as.numeric(telemark_test$subscribe=="yes"))
head(naive_roc)
## prediction trueclass
## 1 0.1772 0
## 2 0.4306 0
## 3 0.1510 0
## 4 0.1432 0
## 5 0.0785 0
## 6 0.0646 0
naive_roc_pred <- prediction(naive_roc$prediction, naive_roc$trueclass)
# ROC curve
plot(performance(naive_roc_pred, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)
# AUC
auc_ROCR_n <- performance(naive_roc_pred, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n
## [1] 0.8456519
# model tuning - metrics function
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)
# apply function metrics
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = naive_prediction_raw$yes,
ref = as.factor(ifelse(telemark_test$subscribe == "yes", 1, 0)),
postarget = "1",
negtarget = "0")
}
# visualize
ggplotly(tibble("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))
#Tuning Threshold
naive_prediction_tuning <- naive_prediction_raw %>%
mutate(label = as.factor(ifelse(yes >= 0.65, "yes", "no"))) %>%
select(label)
naive_matrix_tuning <- confusionMatrix(naive_prediction_tuning$label, naive_prediction, positive = "yes")
naive_matrix <- matrix_result(naive_matrix_tuning, "Naive Bayes Tuning")
naive_matrix
## Model Accuracy Sensitivity Specificity Pos Pred Value
## 1 Naive Bayes Tuning 0.9200398 0.743344 1 1
model_dtree <- ctree(subscribe ~ ., telemark_train_upsample)
dtree_prediction <- predict(model_dtree, telemark_test)
dtree_matrix <- confusionMatrix(dtree_prediction, telemark_test$subscribe, positive = "yes")
dtree_matrix <- matrix_result(dtree_matrix, "Decision Tree")
dtree_matrix
## Model Accuracy Sensitivity Specificity Pos Pred Value
## 1 Decision Tree 0.8310108 0.8414634 0.8296138 0.3976064
model_dtree_tuning <- ctree(subscribe ~ ., telemark_train_upsample,
control = ctree_control(mincriterion = 0.1, minsplit = 100, minbucket = 60))
dtree_prediction_tuning <- predict(model_dtree_tuning, telemark_test)
dtree_matrix_tuning <- confusionMatrix(dtree_prediction_tuning, telemark_test$subscribe)
dtree_matrix_tuning <- matrix_result(dtree_matrix_tuning, "Decision Tree Tuning")
dtree_matrix_tuning
## Model Accuracy Sensitivity Specificity Pos Pred Value
## 1 Decision Tree Tuning 0.8462729 0.8528084 0.7973734 0.969222
ctrl <- trainControl(method = "repeatedcv", number = 5,repeats = 3)
# model_rforest <- train(subscribe ~ ., data = telemark_train_upsample, method = "rf", trControl = ctrl, ntree = 100)
# saveRDS(model_rforest, file = "model_rforest.RDS")
model_rforest <- readRDS("model_rforest.RDS")
rforest_predict <- predict(model_rforest, telemark_test)
rforest_predict_raw <- predict(model_rforest, telemark_test, type = "prob")
plot(model_rforest$finalModel)
legend("topright", colnames(model_rforest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)
rforest_matrix <- confusionMatrix(rforest_predict, telemark_test$subscribe, positive = "yes")
table <- as.table(rforest_matrix)
table <- as.data.frame(table)
table %>% ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), fontface = "bold", color = "white") +
theme_minimal() +
theme(legend.position = "none")
rforest_matrix <- matrix_result(rforest_matrix, "Random Forest")
rforest_matrix
## Model Accuracy Sensitivity Specificity Pos Pred Value
## 1 Random Forest 0.858549 0.8405253 0.8609579 0.4468828
# ROC
forest_roc <- data.frame(prediction = rforest_predict_raw[,2],
trueclass = as.numeric(telemark_test$subscribe=="yes"))
head(forest_roc)
## prediction trueclass
## 1 0.05 0
## 2 0.31 0
## 3 0.00 0
## 4 0.00 0
## 5 0.00 0
## 6 0.02 0
forest_rocz_prediction <- prediction(forest_roc$prediction, forest_roc$trueclass)
# ROC curve
plot(performance(forest_rocz_prediction, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)
# AUC
auc_ROCR_n <- performance(forest_rocz_prediction, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n
## [1] 0.9214208
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)
# apply function metrics
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = rforest_predict_raw$yes,
ref = as.factor(ifelse(telemark_test$subscribe == "yes", 1, 0)),
postarget = "1",
negtarget = "0")
}
# visualize
ggplotly(tibble("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))
#Tuning Threshold
rforest_predict_tuning <- rforest_predict_raw %>%
mutate(label = as.factor(ifelse(yes >= 0.52, "yes", "no"))) %>%
select(label)
rforest_matrix_tuning <- confusionMatrix(naive_prediction_tuning$label, naive_prediction, positive = "yes")
rforest_matrix_tuning <- matrix_result(rforest_matrix_tuning, "Random Forest Tuning")
rforest_matrix_tuning
## Model Accuracy Sensitivity Specificity Pos Pred Value
## 1 Random Forest Tuning 0.9200398 0.743344 1 1